Generate AI-Ready llms.txt Files from Screaming Frog Website Crawls

工作流概述

这是一个包含23个节点的复杂工作流,主要用于自动化处理各种任务。

工作流源代码

下载
{
  "id": "",
  "meta": {
    "instanceId": "",
    "templateCredsSetupCompleted": true
  },
  "name": "Generate AI-Ready llms.txt Files from Screaming Frog Website Crawls",
  "tags": [],
  "nodes": [
    {
      "id": "ca701618-b2d5-48ee-a503-d3513d018a65",
      "name": "Sticky Note",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        360,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Form - Screaming Frog internal_html.csv upload  

This form node is used to trigger the workflow.  

It contains **three input fields**:  
- Name of the website  
- Short description of the website  
- **Screaming Frog** export containing the internal URLs  



It is recommended to use the **internal_html.csv** export, but **internal_all.csv** will also work, as the workflow includes a filter to process only indexable URLs.
"
      },
      "typeVersion": 1
    },
    {
      "id": "bc040ca0-f38d-4458-a60c-17f71dbfd1ea",
      "name": "Sticky Note1",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        780,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Extract data from Screaming Frog file

This node extracts data from the **CSV file** provided by the user.  

It produces an output that is **easily usable** in the following nodes.  

⚠️ **Caution:**  
If the uploaded file is **not** the expected Screaming Frog export, the workflow will still proceed but will likely **fail in the next steps** due to missing required fields.  

"
      },
      "typeVersion": 1
    },
    {
      "id": "f71a7d10-847d-48e7-8820-ec0c1e7ea055",
      "name": "Sticky Note2",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1200,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Set Useful Fields  

This node sets **7 key fields** from the Screaming Frog export:  

- `url` → from the **\"Address\"** column  
- `title` → from the **\"Title 1\"** column  
- `description` → from the **\"Meta Description 1\"** column  
- `status` → from the **\"Status Code\"** column  
- `indexability` → from the **\"Indexability\"** column  
- `content_type` → from the **\"Content Type\"** column  
- `word_count` → from the **\"Word Count\"** column  


**Multi-language compatibility**  
If you're using Screaming Frog in **French, Italian, German, or Spanish**, the column names will be different.  
However, the workflow is designed to handle this, so it will **still work correctly**! 🥳
"
      },
      "typeVersion": 1
    },
    {
      "id": "6f6546b8-adeb-4998-ae19-d93525337eb7",
      "name": "Set useful fields",
      "type": "n8n-nodes-base.set",
      "position": [
        1340,
        60
      ],
      "parameters": {
        "options": {},
        "assignments": {
          "assignments": [
            {
              "id": "0e7d4a06-83fc-4834-93fe-2e758cbe2307",
              "name": "url",
              "type": "string",
              "value": "={{ $json.Address || $json.Adresse || $json.Dirección || $json.Indirizzo }}"
            },
            {
              "id": "c82f4d4c-9d0b-4c7d-9647-5d0240b58643",
              "name": "title",
              "type": "string",
              "value": "={{ $json['Title 1'] || $json['Titolo 1'] || $json['Titolo 1'] || $json['Título 1'] || $json['Titel 1'] }}"
            },
            {
              "id": "abea81db-ce3b-4ac1-bd21-09ccfffb567a",
              "name": "description",
              "type": "string",
              "value": "={{ $json['Meta Description 1'] || $json['Meta description 1'] }}"
            },
            {
              "id": "2ca75d74-70f8-400b-b862-9da186135915",
              "name": "statut",
              "type": "string",
              "value": "={{ $json['Status Code'] || $json['Code HTTP'] || $json['Status-Code'] || $json['Código de respuesta'] || $json['Codice di stato']}}"
            },
            {
              "id": "754d3202-38b0-4d79-ba24-8078b3244307",
              "name": "indexability",
              "type": "string",
              "value": "={{ $json.Indexability || $json.Indexabilité || $json.Indicizzabilità || $json.Indexabilidad || $json.Indexierbarkeit}}"
            },
            {
              "id": "8bc6583d-bb34-4d22-b310-fe79bb8ac85d",
              "name": "content_type",
              "type": "string",
              "value": "={{ $json['Content Type'] || $json['Type de contenu'] || $json['Tipo di contenuto'] || $json['Tipo de contenido'] || $json['Inhaltstyp']}}"
            },
            {
              "id": "c874ba1a-769e-43d3-9555-8c9914ca9b76",
              "name": "word_count",
              "type": "string",
              "value": "={{ $json['Word Count'] || $json['Nombre de mots'] || $json['Conteggio delle parole'] || $json['Conteggio delle parole'] || $json['Recuento de palabras'] || $json['Wortanzahl'] }}"
            }
          ]
        }
      },
      "typeVersion": 3.4
    },
    {
      "id": "1a9af7a0-d2d5-44cb-9770-2d5a1e5706f4",
      "name": "Text Classifier",
      "type": "@n8n/n8n-nodes-langchain.textClassifier",
      "disabled": true,
      "position": [
        2260,
        60
      ],
      "parameters": {
        "options": {},
        "inputText": "=url : {{ $json.url }}
title : {{ $json.title }}
description : {{ $json.description }}
words count : {{ $json.word_count }}",
        "categories": {
          "categories": [
            {
              "category": "useful_content",
              "description": "Pages that are likely to contain high-quality content, making them suitable for inclusion in a file that aids content discovery for an LLM. "
            },
            {
              "category": "other_content",
              "description": "Pages that should not be included (e.g., pagination, or low-value content)."
            }
          ]
        }
      },
      "typeVersion": 1
    },
    {
      "id": "74a4e378-4228-4142-92ca-e541efde2b15",
      "name": "OpenAI Chat Model",
      "type": "@n8n/n8n-nodes-langchain.lmChatOpenAi",
      "position": [
        2180,
        240
      ],
      "parameters": {
        "model": {
          "__rl": true,
          "mode": "list",
          "value": "gpt-4o-mini"
        },
        "options": {}
      },
      "credentials": {
        "openAiApi": {
          "id": "",
          "name": "OpenAi Connection"
        }
      },
      "typeVersion": 1.2
    },
    {
      "id": "63dc6cfe-bc73-43b5-8c7d-4f5fd6501d3b",
      "name": "No Operation, do nothing",
      "type": "n8n-nodes-base.noOp",
      "position": [
        2580,
        200
      ],
      "parameters": {},
      "typeVersion": 1
    },
    {
      "id": "cb555b99-9e63-4b6b-a1fc-512b5467d666",
      "name": "Sticky Note3",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        1620,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Filter URLs 

This **filter node** is used to keep only the URLs that meet the following conditions:  
- `status` = **200**  
- `indexability` = **indexable**  
- `content_type` contains **text/html**  


These filters are even **more useful** if the uploaded file is an **internal_all.csv** instead of an **internal_html.csv**.  

### **Tips:**  
You can **add more filters** to refine the URLs included in your `llms.txt` file.  

💡 **Examples:**  
- **Filter by word count** → Ensure pages contain **enough text content**.  
- **Filter by URL path** → Keep only **specific folders or categories** in the `llms.txt` file.  
- **Filter by meta description** → Exclude URLs **without a meta description**, as this field will be used in the `llms.txt` file to describe each piece of content.  
"
      },
      "typeVersion": 1
    },
    {
      "id": "e34e56e2-5cc8-4e50-bfb0-3aa2e5e04abf",
      "name": "Filter URLs",
      "type": "n8n-nodes-base.filter",
      "position": [
        1740,
        60
      ],
      "parameters": {
        "options": {},
        "conditions": {
          "options": {
            "version": 2,
            "leftValue": "",
            "caseSensitive": true,
            "typeValidation": "strict"
          },
          "combinator": "and",
          "conditions": [
            {
              "id": "cef4feaa-1c46-45b1-92b7-f5c2051b1dc5",
              "operator": {
                "type": "number",
                "operation": "equals"
              },
              "leftValue": "={{ Number($json.statut) }}",
              "rightValue": 200
            },
            {
              "id": "bb821656-9740-4da4-8aa9-f65ad098c470",
              "operator": {
                "type": "boolean",
                "operation": "true",
                "singleValue": true
              },
              "leftValue": "={{ [\"Indexable\", \"Indicizzabile\", \"Indexierbar\"].includes($json.indexability) }}",
              "rightValue": "={{ \"Indexable\" || \"Indicizzabile\" }}"
            },
            {
              "id": "5c93ddb8-8091-406a-bc04-fa14e8b73fb9",
              "operator": {
                "type": "string",
                "operation": "contains"
              },
              "leftValue": "={{ $json.content_type }}",
              "rightValue": "text/html"
            }
          ]
        }
      },
      "typeVersion": 2.2
    },
    {
      "id": "b98f19a8-afd3-4d26-8063-dee3ee75055f",
      "name": "Sticky Note4",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2040,
        -800
      ],
      "parameters": {
        "color": 2,
        "width": 740,
        "height": 1160,
        "content": "## Text Classifier

🚫 **This node is deactivated by default** in the template.  

You can **enable it** if you want to add a more **\"intelligent\" 🤓 filter** to refine the URLs included in the `llms.txt` file, helping LLMs discover and prioritize valuable content.

### How It Works:
This node has **two outputs**:  
- **`useful_content`** → Pages that are **likely to contain high-quality content**, making them suitable for inclusion in a file that **aids content discovery for an LLM**.  
- **`other_content`** → Pages that should **not** be included (e.g., pagination or low-value content).  


You can **modify the description** in the node to fine-tune the classification according to your needs.  

### Input Fields:
- **url** → `{{ $json.url }}`  
- **title** → `{{ $json.title }}`  
- **description** → `{{ $json.description }}`  
- **word_count** → `{{ $json.word_count }}`  

### Why use an LLM?  
A **language model (LLM)** can **analyze** the **URL, title, and description** to identify pages that **most likely contain meaningful and relevant content**.  
This allows it to **prioritize valuable pages** and structure the data for **better content discovery and training purposes**. 

### **For large websites**  
If you have a **very large website**, consider using a **Loop Over Items** node to make the workflow **more robust** and ensure all pages are processed.  
Also, using a **Loop Over Items** node make it **easier** to handle:  
- **Timeouts** 
- **API quotas** 
- **Other scalability issues**

### Tokens usage
Finally, keep in mind that **more pages mean more tokens and more billed LLM API calls**.







"
      },
      "typeVersion": 1
    },
    {
      "id": "63e3ea7a-cec3-442c-9812-771def0a9949",
      "name": "Sticky Note5",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        2840,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Set Field - llms.txt Row

This node **sets** the row format for the `llms.txt` file.  

### Row Structure:
Each row follows this format:  

- `- [title](link): description`  

If the URL **has no description** (from the **Meta Description** in the Screaming Frog export), the row will be:  

- `- [title](link)`  
"
      },
      "typeVersion": 1
    },
    {
      "id": "78f58220-feb5-4044-b994-39a0e4f1e9e4",
      "name": "Sticky Note6",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        3260,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Summarize - Concatenate

This node concatenates all the output from the previous node, ensuring each row is on a separate line."
      },
      "typeVersion": 1
    },
    {
      "id": "7a119633-7cd3-4de5-a1cd-7f708e1abf4a",
      "name": "Sticky Note7",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        3680,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Set Fields - llms.txt Content

This node sets the content of the `llms.txt` file using:

- The **website title** provided in the form (first node).
- The **website description** provided in the form (first node).
- The output from the previous node, which includes all the URLs, their titles, and their descriptions that will appear in the `llms.txt` file.
"
      },
      "typeVersion": 1
    },
    {
      "id": "554f6858-68e8-4b35-a6c4-21bed6832323",
      "name": "Sticky Note8",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        4100,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## Generate llms.txt file

This node **creates** the `llms.txt` file, which can be **downloaded directly** within n8n. 
"
      },
      "typeVersion": 1
    },
    {
      "id": "24bdefba-e2f2-41f0-93e7-9f8d2fc11f43",
      "name": "Sticky Note9",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        4520,
        -500
      ],
      "parameters": {
        "color": 7,
        "width": 360,
        "height": 860,
        "content": "## upload file anywhere

Instead of downloading the file directly from the n8n workflow, you can **replace this node node** with a Drive node (e.g., **Google Drive** or **OneDrive**) to upload the `llms.txt` file to a folder of your choice.  
  
**Name the file properly** (e.g., include the website name) to make it easier to find and distinguish between files when working on multiple websites.  
"
      },
      "typeVersion": 1
    },
    {
      "id": "a3be51e3-810c-40a7-a996-98a3d383c2b9",
      "name": "Summarize - Concatenate",
      "type": "n8n-nodes-base.summarize",
      "position": [
        3380,
        40
      ],
      "parameters": {
        "options": {},
        "fieldsToSummarize": {
          "values": [
            {
              "field": "llmTxtRow",
              "separateBy": "
",
              "aggregation": "concatenate"
            }
          ]
        }
      },
      "typeVersion": 1.1
    },
    {
      "id": "8d3a892a-3d11-4d8a-8ec6-84f8f3af1183",
      "name": "Set Fields - llms.txt Content",
      "type": "n8n-nodes-base.set",
      "position": [
        3820,
        40
      ],
      "parameters": {
        "options": {},
        "assignments": {
          "assignments": [
            {
              "id": "97062a99-e944-4e1e-89b1-62cf9e3462dd",
              "name": "llmsTxtFile",
              "type": "string",
              "value": "=# {{ $('Form - Screaming frog internal_html.csv upload').item.json['What is the name of your website?'] }}
> {{ $('Form - Screaming frog internal_html.csv upload').item.json['Can you provide a short description of your website? (in the language of the website)'] }}

{{ $json.concatenated_llmTxtRow }}"
            }
          ]
        }
      },
      "typeVersion": 3.4
    },
    {
      "id": "bc2a692a-47ea-4bf1-a102-e607fd544158",
      "name": "upload file anywhere",
      "type": "n8n-nodes-base.noOp",
      "position": [
        4640,
        40
      ],
      "parameters": {},
      "typeVersion": 1
    },
    {
      "id": "404510a2-35b2-44cf-9d02-eb0abcf4e9b3",
      "name": "Set Field - llms.txt Row",
      "type": "n8n-nodes-base.set",
      "position": [
        2960,
        40
      ],
      "parameters": {
        "options": {},
        "assignments": {
          "assignments": [
            {
              "id": "95e75caa-8110-476b-9cb1-73c15361fa56",
              "name": "llmTxtRow",
              "type": "string",
              "value": "=- [{{ $json.title }}]({{ $json.url }}){{ $json.description ? ': ' + $json.description : '' }}"
            }
          ]
        }
      },
      "typeVersion": 3.4
    },
    {
      "id": "f54d51f2-17bc-4c58-b177-0e77e16f7b72",
      "name": "Sticky Note10",
      "type": "n8n-nodes-base.stickyNote",
      "position": [
        -420,
        -1020
      ],
      "parameters": {
        "color": 5,
        "width": 700,
        "height": 1380,
        "content": "# Generate AI-Ready llms.txt Files from Screaming Frog Website Crawls  

This workflow helps you generate an **llms.txt** file (if you're unfamiliar with it, check out [this article](https://towardsdatascience.com/llms-txt-414d5121bcb3/)) using a **Screaming Frog export**.  

[Screaming Frog](https://www.screamingfrog.co.uk/seo-spider/) is a well-known website crawler.  
You can easily crawl a website. Then, export the **\"internal_html\"** section in CSV format.  

## How It Works: 

A **form** allows you to enter:  
- The **name of the website**  
- A **short description**  
- The **internal_html.csv** file from your Screaming Frog export  


Once the form is submitted, the **workflow is triggered automatically**, and you can **download the llms.txt file directly from n8n**. 

## Downloading the File
Since the last node in this workflow is **\"Convert to File\"**, you will need to **download the file directly from the n8n UI**.  
However, you can easily **add a node** (e.g., Google Drive, OneDrive) to automatically upload the file **wherever you want**.  

## AI-Powered Filtering (Optional):  
This workflow includes a **text classifier node**, which is **deactivated by default**.  
- You can **activate it** to apply a more **intelligent filter** to select URLs for the `llms.txt` file.  
- Consider modifying the **description** in the classifier node to specify the type of URLs you want to include.  

## How to Use This Workflow  

1. **Crawl the website** you want to generate an `llms.txt` file for using **Screaming Frog**.  
2. **Export the \"internal_html\"** section in CSV format.  
   ![Screaming Frog internal html export](https://i.imgur.com/M0nJQiV.png)  
3. In **n8n**, click **\"Test Workflow\"**, fill in the form, and **upload** the `internal_html.csv` file.  
4. Once the workflow is complete, go to the **\"Export to File\"** node and **download the output**.  

**That's it! You now have your llms.txt file!**  



**Recommended Usage:**  
Use this workflow **directly in the n8n UI by clicking** 'Test Workflow' and uploading the file in the form."
      },
      "typeVersion": 1
    },
    {
      "id": "e33104af-802a-43f2-b26d-f368f7de2fd7",
      "name": "Form - Screaming frog internal_html.csv upload",
      "type": "n8n-nodes-base.formTrigger",
      "position": [
        460,
        60
      ],
      "webhookId": "8791f39a-3d81-405c-b177-0a733ebf74cb",
      "parameters": {
        "options": {
          "buttonLabel": "Get the llms.txt file"
        },
        "formTitle": "llms.txt Generator - From Screaming Frog export",
        "formFields": {
          "values": [
            {
              "fieldLabel": "What is the name of your website?",
              "placeholder": "Example : The best website ever",
              "requiredField": true
            },
            {
              "fieldLabel": "Can you provide a short description of your website? (in the language of the website)",
              "placeholder": "Example : This is the best website ever because all the content is engaging and valuable.",
              "requiredField": true
            },
            {
              "fieldType": "file",
              "fieldLabel": "screaming_frog_export",
              "multipleFiles": false,
              "requiredField": true,
              "acceptFileTypes": ".csv"
            }
          ]
        },
        "responseMode": "lastNode",
        "formDescription": "Generate a simple llms.txt file from a Screaming Frog Export
It is recommended to use the internal_html.csv export, although internal_all.csv will also work.

Fill in the fields in this form.Just fill in the fields in this form  😄"
      },
      "typeVersion": 2.2
    },
    {
      "id": "f6b17fdd-a098-411e-8d53-3f6e638cc3ba",
      "name": "Extract data from Screaming Frog file",
      "type": "n8n-nodes-base.extractFromFile",
      "position": [
        900,
        60
      ],
      "parameters": {
        "options": {},
        "operation": "xls",
        "binaryPropertyName": "screaming_frog_export"
      },
      "typeVersion": 1
    },
    {
      "id": "6bbd8d1f-3322-4c6d-af08-c842386239ce",
      "name": "Generate llms.txt file",
      "type": "n8n-nodes-base.convertToFile",
      "position": [
        4220,
        40
      ],
      "parameters": {
        "options": {
          "encoding": "utf8",
          "fileName": "llms.txt"
        },
        "operation": "toText",
        "sourceProperty": "llmsTxtFile"
      },
      "typeVersion": 1.1
    }
  ],
  "active": false,
  "pinData": {},
  "settings": {
    "executionOrder": "v1"
  },
  "versionId": "",
  "connections": {
    "Filter URLs": {
      "main": [
        [
          {
            "node": "Text Classifier",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Text Classifier": {
      "main": [
        [
          {
            "node": "Set Field - llms.txt Row",
            "type": "main",
            "index": 0
          }
        ],
        [
          {
            "node": "No Operation, do nothing",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "OpenAI Chat Model": {
      "ai_languageModel": [
        [
          {
            "node": "Text Classifier",
            "type": "ai_languageModel",
            "index": 0
          }
        ]
      ]
    },
    "Set useful fields": {
      "main": [
        [
          {
            "node": "Filter URLs",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Generate llms.txt file": {
      "main": [
        []
      ]
    },
    "Summarize - Concatenate": {
      "main": [
        [
          {
            "node": "Set Fields - llms.txt Content",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Set Field - llms.txt Row": {
      "main": [
        [
          {
            "node": "Summarize - Concatenate",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Set Fields - llms.txt Content": {
      "main": [
        [
          {
            "node": "Generate llms.txt file",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Extract data from Screaming Frog file": {
      "main": [
        [
          {
            "node": "Set useful fields",
            "type": "main",
            "index": 0
          }
        ]
      ]
    },
    "Form - Screaming frog internal_html.csv upload": {
      "main": [
        [
          {
            "node": "Extract data from Screaming Frog file",
            "type": "main",
            "index": 0
          }
        ]
      ]
    }
  }
}

功能特点

  • 自动检测新邮件
  • AI智能内容分析
  • 自定义分类规则
  • 批量处理能力
  • 详细的处理日志

技术分析

节点类型及作用

  • Stickynote
  • Set
  • @N8N/N8N Nodes Langchain.Textclassifier
  • @N8N/N8N Nodes Langchain.Lmchatopenai
  • Noop

复杂度评估

配置难度:
★★★★☆
维护难度:
★★☆☆☆
扩展性:
★★★★☆

实施指南

前置条件

  • 有效的Gmail账户
  • n8n平台访问权限
  • Google API凭证
  • AI分类服务订阅

配置步骤

  1. 在n8n中导入工作流JSON文件
  2. 配置Gmail节点的认证信息
  3. 设置AI分类器的API密钥
  4. 自定义分类规则和标签映射
  5. 测试工作流执行
  6. 配置定时触发器(可选)

关键参数

参数名称 默认值 说明
maxEmails 50 单次处理的最大邮件数量
confidenceThreshold 0.8 分类置信度阈值
autoLabel true 是否自动添加标签

最佳实践

优化建议

  • 定期更新AI分类模型以提高准确性
  • 根据邮件量调整处理批次大小
  • 设置合理的分类置信度阈值
  • 定期清理过期的分类规则

安全注意事项

  • 妥善保管API密钥和认证信息
  • 限制工作流的访问权限
  • 定期审查处理日志
  • 启用双因素认证保护Gmail账户

性能优化

  • 使用增量处理减少重复工作
  • 缓存频繁访问的数据
  • 并行处理多个邮件分类任务
  • 监控系统资源使用情况

故障排除

常见问题

邮件未被正确分类

检查AI分类器的置信度阈值设置,适当降低阈值或更新训练数据。

Gmail认证失败

确认Google API凭证有效且具有正确的权限范围,重新进行OAuth授权。

调试技巧

  • 启用详细日志记录查看每个步骤的执行情况
  • 使用测试邮件验证分类逻辑
  • 检查网络连接和API服务状态
  • 逐步执行工作流定位问题节点

错误处理

工作流包含以下错误处理机制:

  • 网络超时自动重试(最多3次)
  • API错误记录和告警
  • 处理失败邮件的隔离机制
  • 异常情况下的回滚操作